Root Mean Squared Error (RMSE) — Regression Metric (From Scratch)#

RMSE measures the typical size of prediction errors in the same units as the target. It is the square-root of mean squared error (MSE), so it penalizes large errors more than MAE.

Goals

  • Build intuition with small numeric examples + Plotly visuals

  • Derive the formula (and gradients) with clear notation

  • Implement RMSE in NumPy (from scratch) and validate vs scikit-learn

  • Use RMSE/MSE to fit a simple linear regression with gradient descent

  • Summarize pros/cons, good use cases, and common pitfalls

Quick import#

from sklearn.metrics import root_mean_squared_error

Equivalent: mean_squared_error(..., squared=False).

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
import plotly
import sklearn

print("numpy  :", np.__version__)
print("pandas :", pd.__version__)
print("sklearn:", sklearn.__version__)
print("plotly :", plotly.__version__)
numpy  : 1.26.2
pandas : 2.1.3
sklearn: 1.6.0
plotly : 6.5.2

Prerequisites#

  • Basic regression setup: true targets \(y\) and predictions \(\hat y\)

  • Vector norms and means

  • (Optional) Basic derivatives for the gradient section

1) Definition and notation#

Let:

  • \(y \in \mathbb{R}^n\) be the true targets

  • \(\hat y \in \mathbb{R}^n\) be the predictions

  • residuals (signed errors) \(r_i = \hat y_i - y_i\)

The mean squared error (MSE) is:

\[ \mathrm{MSE}(y, \hat y) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2 = \frac{1}{n}\sum_{i=1}^n r_i^2 \]

The root mean squared error (RMSE) is:

\[ \mathrm{RMSE}(y, \hat y) = \sqrt{\mathrm{MSE}(y, \hat y)} = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2} \]

Vector form#

With residual vector \(r = \hat y - y\):

\[ \mathrm{RMSE}(y, \hat y) = \frac{\lVert r \rVert_2}{\sqrt{n}} \]

Key interpretation: RMSE is a scaled Euclidean distance between the prediction vector and the target vector.

Weighted RMSE#

For non-negative sample weights \(w_i\):

\[ \mathrm{RMSE}_w(y, \hat y; w) = \sqrt{\frac{\sum_{i=1}^n w_i (y_i - \hat y_i)^2}{\sum_{i=1}^n w_i}} \]

Multioutput targets#

For \(y, \hat y \in \mathbb{R}^{n \times d}\), compute per-output RMSE:

\[ \mathrm{RMSE}_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{ij} - \hat y_{ij})^2}, \quad j=1,\dots,d \]

and then aggregate (uniform average by default, or a weighted average over outputs).

2) Intuition: “average error size” with extra penalty for big misses#

If you make one error twice as large, its squared contribution becomes larger.

That means RMSE:

  • has the same units as \(y\) (unlike MSE)

  • behaves like a “typical” error magnitude

  • is more sensitive to outliers than MAE because of the square

A common statistical view: if residuals are i.i.d. Gaussian with constant variance, minimizing MSE/RMSE is equivalent to maximum likelihood.

3) A tiny worked example#

We’ll compute RMSE step-by-step on a small set of points.

y_true = np.array([3.0, -0.5, 2.0, 7.0, 4.2])
y_pred = np.array([2.5, 0.0, 2.1, 7.8, 1.9])

residual = y_pred - y_true
abs_error = np.abs(residual)
sq_error = residual**2

df = pd.DataFrame(
    {
        "y_true": y_true,
        "y_pred": y_pred,
        "residual (y_pred - y_true)": residual,
        "|residual|": abs_error,
        "residual^2": sq_error,
    }
)

mse = float(np.mean(sq_error))
rmse = float(np.sqrt(mse))

rmse_sklearn = root_mean_squared_error(y_true, y_pred)

df, mse, rmse, rmse_sklearn
(   y_true  y_pred  residual (y_pred - y_true)  |residual|  residual^2
 0     3.0     2.5                        -0.5         0.5        0.25
 1    -0.5     0.0                         0.5         0.5        0.25
 2     2.0     2.1                         0.1         0.1        0.01
 3     7.0     7.8                         0.8         0.8        0.64
 4     4.2     1.9                        -2.3         2.3        5.29,
 1.288,
 1.1349008767288886,
 1.1349008767288886)

4) How squaring changes “importance” across samples#

MAE gives each sample weight proportional to \(|r_i|\).

MSE/RMSE gives each sample weight proportional to \(r_i^2\), so large residuals dominate more.

Below we plot the fraction of total error each sample contributes under MAE vs MSE/RMSE.

idx = np.arange(len(y_true))

mae_weights = abs_error / abs_error.sum()
mse_weights = sq_error / sq_error.sum()

fig = go.Figure()
fig.add_trace(go.Bar(x=idx, y=mae_weights, name="MAE weight  (|r| / sum|r|)"))
fig.add_trace(go.Bar(x=idx, y=mse_weights, name="MSE/RMSE weight (r^2 / sum r^2)"))
fig.update_layout(
    barmode="group",
    title="Per-sample influence under MAE vs RMSE",
    xaxis_title="sample index",
    yaxis_title="fraction of total error",
)
fig.show()

5) Outlier sensitivity: RMSE vs MAE#

We’ll keep all residuals fixed except one, and sweep that one residual from 0 to a large value.

Because RMSE squares residuals, it grows faster than MAE as the outlier grows.

n = 200
base_residuals = rng.normal(0.0, 1.0, size=n)
base_residuals[0] = 0.0

outlier_values = np.linspace(0, 15, 200)
rmse_vals = []
mae_vals = []

for v in outlier_values:
    r = base_residuals.copy()
    r[0] = v
    rmse_vals.append(float(np.sqrt(np.mean(r**2))))
    mae_vals.append(float(np.mean(np.abs(r))))

df_sweep = pd.DataFrame({"outlier |r|": outlier_values, "RMSE": rmse_vals, "MAE": mae_vals})

fig = px.line(
    df_sweep,
    x="outlier |r|",
    y=["RMSE", "MAE"],
    title="Single outlier sweep: RMSE grows faster than MAE",
    labels={"value": "metric value (same units as y)", "variable": "metric"},
)
fig.show()

6) Key property: constant predictor → mean#

If your model can only predict a constant \(c\) for every sample (\(\hat y_i = c\)), then:

\[ \mathrm{RMSE}(c) = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - c)^2} \]

Because the square-root is monotonic, the minimizer of RMSE\((c)\) is the same as the minimizer of:

\[ \mathrm{MSE}(c) = \frac{1}{n}\sum_{i=1}^n (y_i - c)^2 \]

Differentiate and set to zero:

\[ \frac{d}{dc}\mathrm{MSE}(c) = -\frac{2}{n}\sum_{i=1}^n (y_i - c) = 0 \quad\Rightarrow\quad c^* = \bar y \]

This is one reason RMSE is sensitive to outliers: the mean shifts.

In contrast, MAE is minimized by the median (more robust to outliers).

y_const = rng.normal(0.0, 1.0, size=60)
y_const[0] = 8.0  # add a clear outlier

c_grid = np.linspace(y_const.min() - 1, y_const.max() + 1, 400)
rmse_curve = np.sqrt(np.mean((y_const[None, :] - c_grid[:, None]) ** 2, axis=1))
mae_curve = np.mean(np.abs(y_const[None, :] - c_grid[:, None]), axis=1)

c_mean = float(np.mean(y_const))
c_median = float(np.median(y_const))
y_top = float(max(rmse_curve.max(), mae_curve.max()))

fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=rmse_curve, mode="lines", name="RMSE(c)"))
fig.add_trace(go.Scatter(x=c_grid, y=mae_curve, mode="lines", name="MAE(c)"))
fig.add_vline(x=c_mean, line_dash="dash", line_color="#1f77b4")
fig.add_vline(x=c_median, line_dash="dash", line_color="#ff7f0e")
fig.add_annotation(x=c_mean, y=y_top, text="mean", showarrow=False, yshift=10, font=dict(color="#1f77b4"))
fig.add_annotation(
    x=c_median,
    y=y_top,
    text="median",
    showarrow=False,
    yshift=10,
    font=dict(color="#ff7f0e"),
)
fig.update_layout(
    title="Constant predictor: RMSE is minimized by the mean (MAE by the median)",
    xaxis_title="constant prediction c",
    yaxis_title="metric value",
)
fig.show()

7) NumPy implementation (from scratch)#

We’ll implement RMSE in a way that matches scikit-learn’s signature:

  • supports 1D or 2D targets (n_outputs)

  • optional sample_weight

  • multioutput: return per-output values (raw_values) or average (uniform_average)

def _as_2d(y):
    y = np.asarray(y, dtype=float)
    if y.ndim == 1:
        return y.reshape(-1, 1)
    if y.ndim == 2:
        return y
    raise ValueError("y must be 1D or 2D (n_samples,) or (n_samples, n_outputs).")


def mse_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
    """Mean squared error with scikit-learn-like multioutput handling."""
    y_true_2d = _as_2d(y_true)
    y_pred_2d = _as_2d(y_pred)

    if y_true_2d.shape != y_pred_2d.shape:
        raise ValueError(f"shape mismatch: y_true{y_true_2d.shape} vs y_pred{y_pred_2d.shape}")

    residual = y_pred_2d - y_true_2d

    if sample_weight is None:
        mse_per_output = np.mean(residual**2, axis=0)
    else:
        w = np.asarray(sample_weight, dtype=float)
        if w.ndim != 1:
            raise ValueError("sample_weight must be 1D of shape (n_samples,).")
        if w.shape[0] != y_true_2d.shape[0]:
            raise ValueError("sample_weight length must match n_samples.")
        w = w.reshape(-1, 1)
        mse_per_output = np.sum(w * residual**2, axis=0) / np.sum(w, axis=0)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return mse_per_output
        if multioutput == "uniform_average":
            return float(np.mean(mse_per_output))
        raise ValueError(
            "multioutput must be 'raw_values', 'uniform_average', or array-like of shape (n_outputs,)."
        )

    weights = np.asarray(multioutput, dtype=float)
    if weights.shape != (mse_per_output.shape[0],):
        raise ValueError("multioutput weights must match n_outputs.")
    return float(np.average(mse_per_output, weights=weights))


def rmse_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
    """Root mean squared error (RMSE): sqrt(mean((y_pred - y_true)^2))."""
    mse_per_output = mse_np(
        y_true,
        y_pred,
        sample_weight=sample_weight,
        multioutput="raw_values",
    )
    rmse_per_output = np.sqrt(mse_per_output)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return rmse_per_output
        if multioutput == "uniform_average":
            return float(np.mean(rmse_per_output))
        raise ValueError(
            "multioutput must be 'raw_values', 'uniform_average', or array-like of shape (n_outputs,)."
        )

    weights = np.asarray(multioutput, dtype=float)
    if weights.shape != (rmse_per_output.shape[0],):
        raise ValueError("multioutput weights must match n_outputs.")
    return float(np.average(rmse_per_output, weights=weights))


y_true_rand = rng.normal(size=(50, 3))
y_pred_rand = y_true_rand + rng.normal(scale=0.5, size=y_true_rand.shape)

print("ours raw:", rmse_np(y_true_rand, y_pred_rand, multioutput="raw_values"))
print("sk   raw:", root_mean_squared_error(y_true_rand, y_pred_rand, multioutput="raw_values"))

sample_w = rng.uniform(0.5, 2.0, size=y_true_rand.shape[0])
print("ours weighted:", rmse_np(y_true_rand, y_pred_rand, sample_weight=sample_w))
print("sk   weighted:", root_mean_squared_error(y_true_rand, y_pred_rand, sample_weight=sample_w))
ours raw: [0.45323041 0.53601038 0.40367295]
sk   raw: [0.45323041 0.53601038 0.40367295]
ours weighted: 0.4660120349767279
sk   weighted: 0.4660120349767279

8) RMSE as an objective: gradients and optimization#

RMSE is often reported to humans because it has the same units as \(y\).

When training with gradient-based methods, people often minimize MSE instead:

  • RMSE and MSE have the same minimizer (square-root is monotonic)

  • the MSE gradient is simpler and avoids dividing by RMSE

Still, we can differentiate RMSE and optimize it directly; the next section shows both.

8.1 Gradients#

Let \(r_i = \hat y_i - y_i\) and

\[ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n r_i^2, \qquad \mathrm{RMSE} = \sqrt{\mathrm{MSE}} \]

Gradient w.r.t. predictions:

\[ \frac{\partial\,\mathrm{MSE}}{\partial \hat y_i} = \frac{2}{n} r_i \]

If RMSE \(> 0\):

\[ \frac{\partial\,\mathrm{RMSE}}{\partial \hat y_i} = \frac{1}{2\sqrt{\mathrm{MSE}}}\cdot \frac{\partial\,\mathrm{MSE}}{\partial \hat y_i} = \frac{r_i}{n\,\mathrm{RMSE}} \]

So (for RMSE \(> 0\)) the gradients point in the same direction; they differ by a scalar factor \(\frac{1}{2\,\mathrm{RMSE}}\).

For a linear model \(\hat y_i = w x_i + b\), the parameter gradients are:

\[ \frac{\partial\,\mathrm{MSE}}{\partial w} = \frac{2}{n}\sum_{i=1}^n r_i x_i, \qquad \frac{\partial\,\mathrm{MSE}}{\partial b} = \frac{2}{n}\sum_{i=1}^n r_i \]
\[ \frac{\partial\,\mathrm{RMSE}}{\partial w} = \frac{1}{n\,\mathrm{RMSE}}\sum_{i=1}^n r_i x_i, \qquad \frac{\partial\,\mathrm{RMSE}}{\partial b} = \frac{1}{n\,\mathrm{RMSE}}\sum_{i=1}^n r_i \]

(At a perfect fit where RMSE \(= 0\), the derivative is undefined; in code we add a tiny \(\varepsilon\).)

n = 200
x = rng.uniform(-3, 3, size=n)
true_w, true_b = 1.8, -0.7
y = true_w * x + true_b + rng.normal(0, 0.8, size=n)

perm = rng.permutation(n)
train_size = int(0.8 * n)
train_idx = perm[:train_size]
test_idx = perm[train_size:]

x_tr, y_tr = x[train_idx], y[train_idx]
x_te, y_te = x[test_idx], y[test_idx]

x_tr.shape, x_te.shape
((160,), (40,))
df_data = pd.DataFrame(
    {
        "x": np.concatenate([x_tr, x_te]),
        "y": np.concatenate([y_tr, y_te]),
        "split": np.array(["train"] * len(x_tr) + ["test"] * len(x_te)),
    }
)

fig = px.scatter(
    df_data,
    x="x",
    y="y",
    color="split",
    title="Synthetic 1D regression data",
    labels={"x": "feature x", "y": "target y"},
)
fig.show()

9) Using RMSE/MSE to fit linear regression (from scratch)#

Model (one feature):

\[ \hat y = w x + b \]

We’ll fit \((w, b)\) with gradient descent using either:

  • objective = MSE

  • objective = RMSE

def predict_linear(x, w, b):
    x = np.asarray(x, dtype=float)
    return w * x + b


def fit_linear_gd(x, y, *, lr=0.05, steps=200, objective="mse"):
    """Fit y ≈ w x + b with gradient descent (objective: 'mse' or 'rmse')."""
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)

    w = 0.0
    b = 0.0
    n = x.shape[0]

    hist = {"mse": [], "rmse": [], "w": [], "b": []}

    for _ in range(steps):
        y_hat = predict_linear(x, w, b)
        r = y_hat - y

        mse = float(np.mean(r**2))
        rmse = float(np.sqrt(mse))

        if objective == "mse":
            grad_w = (2.0 / n) * float(np.dot(r, x))
            grad_b = (2.0 / n) * float(np.sum(r))
        elif objective == "rmse":
            denom = max(rmse, 1e-12)
            grad_w = (1.0 / (n * denom)) * float(np.dot(r, x))
            grad_b = (1.0 / (n * denom)) * float(np.sum(r))
        else:
            raise ValueError("objective must be 'mse' or 'rmse'.")

        w -= lr * grad_w
        b -= lr * grad_b

        hist["mse"].append(mse)
        hist["rmse"].append(rmse)
        hist["w"].append(w)
        hist["b"].append(b)

    return w, b, hist
w_mse, b_mse, hist_mse = fit_linear_gd(x_tr, y_tr, lr=0.05, steps=200, objective="mse")
w_rmse, b_rmse, hist_rmse = fit_linear_gd(x_tr, y_tr, lr=0.05, steps=200, objective="rmse")

results = pd.DataFrame(
    {
        "objective": ["mse", "rmse"],
        "w": [w_mse, w_rmse],
        "b": [b_mse, b_rmse],
        "train_rmse": [
            rmse_np(y_tr, predict_linear(x_tr, w_mse, b_mse)),
            rmse_np(y_tr, predict_linear(x_tr, w_rmse, b_rmse)),
        ],
        "test_rmse": [
            rmse_np(y_te, predict_linear(x_te, w_mse, b_mse)),
            rmse_np(y_te, predict_linear(x_te, w_rmse, b_rmse)),
        ],
    }
)

results
objective w b train_rmse test_rmse
0 mse 1.846541 -0.591784 0.749457 0.674895
1 rmse 1.846541 -0.591783 0.749457 0.674896
df_hist = pd.DataFrame(
    {
        "iteration": np.arange(1, len(hist_mse["rmse"]) + 1),
        "rmse (objective=mse)": hist_mse["rmse"],
        "rmse (objective=rmse)": hist_rmse["rmse"],
    }
)

fig = px.line(
    df_hist,
    x="iteration",
    y=["rmse (objective=mse)", "rmse (objective=rmse)"],
    title="Training curve (RMSE on the training set)",
    labels={"value": "RMSE", "variable": "training objective"},
)
fig.show()
x_line = np.linspace(x.min(), x.max(), 200)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_tr, y=y_tr, mode="markers", name="train"))
fig.add_trace(go.Scatter(x=x_te, y=y_te, mode="markers", name="test"))
fig.add_trace(
    go.Scatter(
        x=x_line,
        y=predict_linear(x_line, w_mse, b_mse),
        mode="lines",
        name="fit (objective=mse)",
    )
)
fig.add_trace(
    go.Scatter(
        x=x_line,
        y=predict_linear(x_line, w_rmse, b_rmse),
        mode="lines",
        name="fit (objective=rmse)",
        line=dict(dash="dash"),
    )
)
fig.update_layout(
    title="Linear regression fits (same optimum, different objective scaling)",
    xaxis_title="x",
    yaxis_title="y",
)
fig.show()

10) Sanity check: closed-form least squares and scikit-learn#

For linear regression with squared error, the optimum has a closed-form (least squares) solution.

We’ll compare:

  • gradient descent (above)

  • NumPy least squares (np.linalg.lstsq)

  • sklearn.linear_model.LinearRegression

A = np.column_stack([x_tr, np.ones_like(x_tr)])
w_ls, b_ls = np.linalg.lstsq(A, y_tr, rcond=None)[0]

rmse_train_ls = rmse_np(y_tr, predict_linear(x_tr, w_ls, b_ls))
rmse_test_ls = rmse_np(y_te, predict_linear(x_te, w_ls, b_ls))

pd.DataFrame(
    {
        "method": ["least_squares"],
        "w": [w_ls],
        "b": [b_ls],
        "train_rmse": [rmse_train_ls],
        "test_rmse": [rmse_test_ls],
    }
)
method w b train_rmse test_rmse
0 least_squares 1.846541 -0.591784 0.749457 0.674895
sk_model = LinearRegression().fit(x_tr.reshape(-1, 1), y_tr)
w_sk = float(sk_model.coef_[0])
b_sk = float(sk_model.intercept_)

rmse_train_sk = root_mean_squared_error(y_tr, sk_model.predict(x_tr.reshape(-1, 1)))
rmse_test_sk = root_mean_squared_error(y_te, sk_model.predict(x_te.reshape(-1, 1)))

pd.DataFrame(
    {
        "method": ["sklearn"],
        "w": [w_sk],
        "b": [b_sk],
        "train_rmse": [rmse_train_sk],
        "test_rmse": [rmse_test_sk],
    }
)
method w b train_rmse test_rmse
0 sklearn 1.846541 -0.591784 0.749457 0.674895

11) Practical usage notes (scikit-learn)#

  • For reporting: root_mean_squared_error(y_true, y_pred).

  • For cross-validation: scorers are usually higher-is-better, so scikit-learn uses "neg_root_mean_squared_error".

  • For multi-output regression: set multioutput="raw_values" to get one RMSE per output.

  • For weighted datasets: pass sample_weight.

y_true_2 = np.column_stack([y_true, 2 * y_true])
y_pred_2 = np.column_stack([y_pred, 2 * y_pred + 0.2])

print("ours  raw:", rmse_np(y_true_2, y_pred_2, multioutput="raw_values"))
print("sk    raw:", root_mean_squared_error(y_true_2, y_pred_2, multioutput="raw_values"))

output_weights = np.array([0.25, 0.75])
print("ours weighted outputs:", rmse_np(y_true_2, y_pred_2, multioutput=output_weights))
print("sk   weighted outputs:", root_mean_squared_error(y_true_2, y_pred_2, multioutput=output_weights))

sample_w = np.linspace(1.0, 2.0, len(y_true))
print("ours sample_weight:", rmse_np(y_true, y_pred, sample_weight=sample_w))
print("sk   sample_weight:", root_mean_squared_error(y_true, y_pred, sample_weight=sample_w))
ours  raw: [1.13490088 2.22890107]
sk    raw: [1.13490088 2.22890107]
ours weighted outputs: 1.955401025072826
sk   weighted outputs: 1.955401025072826
ours sample_weight: 1.2794530081249567
sk   sample_weight: 1.2794530081249567

12) Pros, cons, and when to use RMSE#

Pros

  • Same units as the target → easy to interpret

  • Differentiable for RMSE \(> 0\) (unlike MAE at 0) and convex for linear models

  • Strongly penalizes large errors → good when big misses are especially costly

  • Closely tied to least squares / Gaussian noise assumptions

Cons

  • Sensitive to outliers and heavy-tailed noise

  • Scale-dependent: RMSE values are not comparable across targets with different units/scales

  • Can hide systematic bias unless you also inspect residuals (RMSE is a single number)

Good default when

  • You have a regression problem and care more about large errors than small ones

  • Residuals are roughly symmetric and not extremely heavy-tailed

13) Common pitfalls and diagnostics#

  • Always plot residuals: a low RMSE can still mask patterns (non-linearity, heteroscedasticity).

  • Outliers dominate: if your data has rare but huge errors, consider MAE, Huber loss, or quantile losses.

  • Scale matters: if you need comparability, report a normalized RMSE (e.g., divide by target std or range).

  • Skewed targets: if errors are multiplicative (percentage-like), consider RMSLE or modeling on a log scale.

y_hat_te = predict_linear(x_te, w_ls, b_ls)
resid_te = y_hat_te - y_te

fig = px.scatter(
    x=x_te,
    y=resid_te,
    title="Residuals on test set (least squares fit)",
    labels={"x": "x", "y": "residual (y_pred - y_true)"},
)
fig.add_hline(y=0, line_dash="dash")
fig.show()

fig = px.histogram(
    x=resid_te,
    nbins=25,
    title="Residual distribution on test set",
    labels={"x": "residual (y_pred - y_true)", "count": "count"},
)
fig.show()

Exercises#

  1. Implement normalized RMSE: divide RMSE by (y_true.max() - y_true.min()) or y_true.std().

  2. Add a single extreme outlier to the regression dataset and compare the fitted line under RMSE/MSE vs MAE.

  3. Implement a finite-difference gradient check for the RMSE gradient.

References#

  • scikit-learn metric: https://scikit-learn.org/stable/api/sklearn.metrics.html

  • scikit-learn User Guide (mean squared error): https://scikit-learn.org/stable/modules/model_evaluation.html

  • The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — least squares and regression basics